Improving Data Locality by Chunking
نویسندگان
چکیده
Cache memories were invented to decouple fast processors from slow memories. However, this decoupling is only partial, and many researchers have attempted to improve cache use by program optimization. Potential benefits are significant since both energy dissipation and performance highly depend on the traffic between memory levels. But modeling the traffic is difficult; this observation has led to the use of heuristic methods for steering program transformations. In this paper, we propose another approach: we simplify the cache model and we organize the target program in such a way that an asymptotic evaluation of the memory traffic is possible. This information is used by our optimization algorithm in order to find the best reordering of the program operations, at least in an asymptotic sense. Our method optimizes both temporal and spatial locality. It can be applied to any static control program with arbitrary dependences. The optimizer has been partially implemented and applied to non-trivial programs. We present experimental evidence that the amount of cache misses is drastically reduced with corresponding performance improvements.
منابع مشابه
Control of loop parallelism in multithreaded code
Due to the large amount of potential parallelism, resource management is a critical issue in multithreaded architectures. The challenge in code generation is to control the parallelism without reducing the machines ability to exploit it. Controlled parallelism reduces idle time, communication, and delay caused by synchronization. At the same time it increases the potential for exploitation of p...
متن کاملHierarchical Chunking in Classifier Systems
Two standard schemes for learning in classifier systems have been proposed in the literature: the bucket brigade algorithm (BBA) and the profit sharing plan (PSP). The BBA is a local learning scheme which requires less memory and lower peak computation than the PSP, whereas the PSP is a global learning scheme which typically achieves a clearly better performance than the BBA. This “requirement ...
متن کاملGemini: A Computation-Centric Distributed Graph Processing System
Traditionally distributed graph processing systems have largely focused on scalability through the optimizations of inter-node communication and load balance. However, they often deliver unsatisfactory overall processing efficiency compared with shared-memory graph computing frameworks. We analyze the behavior of several graph-parallel systems and find that the added overhead for achieving scal...
متن کاملResource Management in Dataflow-Based Multithreaded Execution
Due to the large amount of potential parallelism, resource management is a critical issue in multithreaded execution. The challenge in code generation is to control the parallelism without reducing the machine's ability to exploit it. Controlled parallelism reduces idle time, communication, and delay caused by synchronization. At the same time it increases the potential for exploitation of prog...
متن کاملDdelta: A deduplication-inspired fast delta compression approach
Delta compression is an efficient data reduction approach to removing redundancy among similar data chunks and files in storage systems. One of the main challenges facing delta compression is its low encoding speed, a worsening problem in face of the steadily increasing storage and network bandwidth and speed. In this paper, we present Ddelta, a deduplication-inspired fast delta compression sch...
متن کامل